Appi. No. 10/955,586 

Amdt. dated lt/24/2004 

Rcplv to the Office Action of 05/24/2004 



Amendmen ts to the Claims: 

This listing of claims will replace all prior versions, and listings, of claims in the 
application: 

Listing of Claims; 

Please amend Claims 1, 4, 7-9, 11, 13, 16, and 19-20, and cancel Claims 3 and 15, 
as follows. 

1 . (Currently Amended) A method comprising the 3tcp o£ 

cleaning, by operations of a computer system, a set of text documents to minimize 
violations of a predetermined set of Hypertext Information Retrieval rules_by: 

decom posing each page of the set of text documents into on e or more pagelets; 

identifying all pagelets belonging to templates: and 

eliminating the template pagel e ts from a data set, and wherein a template 
com prises a collection of pagelet s T s atisfying the following twp requirements: 
m all the p a gelets in T are identic al or almost identical; and 
(2\ every two pases ow ning pagelets in T are re achable one from the other via 
other pages also ownin g pagelets in T. 



2. (Original) The method of claim 1, wherein the set of text documents comprises a 
collection of HTML pages. 

3. (Canceled) 
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4. (Currently Amended) The method of claim 1 wherein the decomposing step 

comprises the steps of: 

parsing each text document into a parse tree that comprises at least one node; 

traversing the at least one node of the tree; 

determining if one of the at least one node comprises a pagelet; and 
outputting a representation corresponding to the one of the at least one node if it 
comprises a pagelet. 

5. (Original) The method of claim 4, wherein the determining step comprises the 
steps of: 

verifying the node is of a type belonging to a predetermined class of eligible 
types; 

verifying the node contains at least a predetermined number of hyperlinks; and 
verifying none of the node's children are pagelets. 

6. (Original) The method of claim 5, wherein the predetermined class of eligible 
types comprises at least one of tables, lists, paragraphs, image maps, headers, table rows, 
table cells, list items, selection bars, and frames. 
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7. (Currently Amended) The method of claim I wherein the step of identifying all 
pagelets belonging to templates comprises the steps of: 

calculating a shingle value for each page and for each pagelet in the set of 

documents; 

eliminating identical pagelets belonging to duplicate pages; 
sorting the pagelets by their shingle value into clusters; 
enumerating the clusters; and 

outputting a representation corresponding to the pagelets belonging to each 

cluster. 
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8. (Currently Amended) The method of claim 3, whcr o ui Oi l steg o f i dentifying 
pagolcts b p longing to tcmplatoo com pr ises the Qtcpa of: 
A method comprising: 

.lining hv o perations nf a com puter system, a set of tex t documents to minimize 
violations of a predetermined set of Hypertext Informati on Retrieval rules by: 

^composin g each p a ge of the se t of text documents into one or more pallets; 
identifying all pallets belon ging to templates; and 

Pliminatinp the template pagelets f rom a data set, and wherein the identifying 
pagelets belonging to te mplates comprises: 

calculating a shingle value for each page and for each pagelet in the 

document set; 

sorting the pagelets by their shingle value into clusters; 
selecting all clusters of size greater than 1 ; 

finding for each cluster all hyperlinks between pages owning pagelets in 

that cluster; 

finding for each cluster all undirected connected components of a graph 
induced by the pages owning pagelets in that cluster; and 

outputting a representation corresponding to the components of size 

greater than 1. 
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9. (Currently Amended) A system comprising: 
a user interface; 

a user interface/event manager communicatively coupled to the user interface; 
a generic data gathering application; 

a generic information retrieval application, communicatively coupled to the user 
interface/event manger; and 

a data cleaning applicatio n, communica tively coupled to the generic data 
gathering application and to the generic information retrieval applicati on, for: 

decomposing each page of a set of text documents into one or more 

pagelets; 

identifying all pagelets belonging to templates; and 

eliminating the template pagelets from a data set, and wherein a template 
com prises a collection of pagelets T sa tisfying the following two requirements; 

(1) aU the pagelets in T are identical o r almost identical: and 

(2) every two pages owning pagelets in T are reachab le one from the other via 
other pages also owning pagelets in T. 

communicatively coupled to the goncric data gathering application and to the generic 
information retrieval application ; 
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10. (Original) The system of claim 9, further comprising: 

a pagelet identifier, communicatively coupled to the data cleaning application; 

a hypertext parser, communicatively coupled to the pagelet identifier; 

a template identifier, communicatively coupled to the data cleaning application; 

a Breadth First Search (BFS) algorithm, communicatively coupled to the template 
identifier; and 

a shingle calculator, communicatively coupled to the data cleaning application. 
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1 1 . (Currently Amended) An apparatus comprising: 
a user interface; 

a user interface/event manager communicatively coupled to the user interface; 
a generic data gathering application; 

a generic information retrieval application, communicatively coupled to the user 
interface/event manger; and 

a data cleaning application, for: 

decomposing each page of the set of text documents into one or more 

pagelets; 

identifying all pagelets belonging to templates; and 
eliminating the template pagelets from a data set, 

communicatively coupled to the generic data gathering application and to 
the generic information retrieval applicatio n and wherein a template comprises a 
collection of pagelets T satisfying the following two requirements; 

Q all the pagelets in T are identical or al most identical: and 
m every two pages ownin g pagelets i n T are reachable one from the other via 
other pages also own ing pagelets in T. 
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12. (Original) The apparatus of claim 1 1 , further comprising: 

a pagelet identifier, communicatively coupled to the data cleaning application; 

a hypertext parser, communicatively coupled to the pagelet identifier, 

a template identifier, communicatively coupled to the data cleaning application; 

a BFS algorithm, communicatively coupled to the template identifier; and 

a shingle calculator, communicatively coupled to the data cleaning application. 

13. (Currently Amended) A computer readable medium including computer 
instructions for driving a user interface, the computer instructions comprising instructions 
for: 

cleaning, by operations of a computer system, a set of text documents to minimize 
violations of a predetermined set of Hypertext Information Retrieval rules_by 

decomposing each page of the set of text documents into one or more pagelets; 
identifying anv pagelets belonging to templatesiand 

eliminatin g the template p a gelets fro m a data set, and wherein a template 
com prises a collection of pagelets T satisfying t he following two requirements: 

Q) all the pagelets in T are identical or al most identical: and 

(2) every two pages owning pagelets in T are reachable one from the other via 
other pages also owning page lets in T. 

14. (Original) The computer readable medium of claim 13, wherein the set of text 
documents comprises a collec tion of HTML pages. 
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15. (Canceled) 

16. (Currently Amended) The computer readable medium of claim 13 44, wherein the 
decomposing step comprises the steps of: 

parsing each text document into a parse tree that comprises at least one node; 
traversing the at least one node of the tree; 

determining if one of the at least one node comprises a pagelet; and 
outputting a representation corresponding to the one of the at least one node if it 
comprises a pagelet. 

17. (Original) The computer readable medium of claim 16, wherein the determining 

step comprises the steps of: 

verifying the node is of a type belonging to a predetermined class of eligible 

types; 

verifying the nodeLContains at least a predetermined number of hyperlinks; and 

verifying none of the node's children are pagelets. 

18. (Original) The computer readable medium of claim 17, wherein the 
predetermined class of eligible types comprises at least one of tables, lists, paragraphs, 
image maps, headers, table rows, table cells, list items, selection bars, and frames. 
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19. ™ (Currently AmeMe^) The computer readable itiediumof claim 13 +5, wherein the 
step of identifying pagelets belonging to templates comprises the steps of: 

calculating a shingle value for each page and for each pagelet in the set of 
documents; 

eliminating identical pagelets belonging to duplicate pages; 
sorting the pagelets by their shingle value into clusters; 
enumerating the clusters; and 

outputting a representation corresponding to the pagelets belonging to eaeh 

cluster. 
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20. (Currently Amended) The computer r eadable medium of olaim 15, wherein the 
step of id e ntifying pagelets belonging to tomplntoq 6€«p^e^Se^5^fe 

A rnm puter readable medium include com puter instructions for driving a user 
interface, the computer instruc tions comprising instructions for: 
cleaning, by operations of a computer system, a set of text documents to minimize 
violations of a predetermined set of Hypertext Information Retrieval rules b^ 

decom posing each page of the set of text documents into one or more pagelets; 
identifying anv pagelets bel onging to templates; and 

eliminating the template pagelets from a da ta set, and wherein the identifying 
pagelets belonging to templa tes comprises: 

calculating a shingle value for each page and for each pagelet in the 

document set; 

sorting the pagelets by their shingle value into clusters; 
selecting all clusters of size greater than 1 ; 

finding for each cluster all hyperlinks between pages owning pagelets in 

that cluster; 

finding for each cluster all undirected connected components of a graph 
induced by the pages owning pagelets in that cluster; and 

outputting a representation corresponding to the components of size 

greater than 1. 
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